Search CORE

115 research outputs found

PRESISTANT : data pre-processing assistant

Author: B Bilalli
B Bilalli
B Bilalli
M Hall
MA Munson
P Brazdil
P Nguyen
UM Fayyad
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

A concrete classification algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. Typically, in order to improve the results, datasets need to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and non-experienced users become overwhelmed. Trial and error is not feasible in the presence of big amounts of data. We developed a method and tool—PRESISTANT, with the aim of answering the need for user assistance during data pre-processing. Leveraging ideas from meta-learning, PRESISTANT is capable of assisting the user by recommending pre-processing operators that ultimately improve the classification performance. The user selects a classification algorithm, from the ones considered, and then PRESISTANT proposes candidate transformations to improve the result of the analysis. In the demonstration, participants will experience, at first hand, how PRESISTANT easily and effectively ranks the pre-processing operators.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Identification of a 5-Protein Biomarker Molecular Signature for Predicting Alzheimer's Disease

Author: A Mendes
C Cotta
C Cotta
CE Finch
H Bruunsgaard
I Ariadne Genomics
IH Witten
Joseph El Khoury
L Niels
Martín Gómez Ravetti
P Moscato
Pablo Moscato
R Berretta
R Berretta
S Magaki
S Ray
UM Fayyad
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Background: Alzheimer’s disease (AD) is a progressive brain disease with a huge cost to human lives. The impact of the disease is also a growing concern for the governments of developing countries, in particular due to the increasingly high number of elderly citizens at risk. Alzheimer’s is the most common form of dementia, a common term for memory loss and other cognitive impairments. There is no current cure for AD, but there are drug and non-drug based approaches for its treatment. In general the drug-treatments are directed at slowing the progression of symptoms. They have proved to be effective in a large group of patients but success is directly correlated with identifying the disease carriers at its early stages. This justifies the need for timely and accurate forms of diagnosis via molecular means. We report here a 5-protein biomarker molecular signature that achieves, on average, a 96% total accuracy in predicting clinical AD. The signature is composed of the abundances of IL-1α, IL-3, EGF, TNF-α and G-CSF. Methodology/Principal Findings: Our results are based on a recent molecular dataset that has attracted worldwide attention. Our paper illustrates that improved results can be obtained with the abundance of only five proteins. Our methodology consisted of the application of an integrative data analysis method. This four step process included: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any sample of the test datasets. For the first two steps, we used the application of Fayyad and Irani’s discretization algorithm for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem; a numerical solution of this problem led to the selection of only 10 proteins. Conclusions/Significance: the previous study has provided an extremely useful dataset for the identification of A biomarkers. However, our subsequent analysis also revealed several important facts worth reporting: 1. A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance (when using the same classifier). 2. Using more than 20 different classifiers available in the widely-used Weka software package, our 5- protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control). 3. Using very simple classifiers, like Simple Logistic or Logistic Model Trees, we have achieved the following results on 92 samples: 100 percent success to predict Alzheimer’s Disease and 92 percent to predict Non Demented Control on the AD dataset

University of Newcastle's Digital Repository

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

A bioinformatics knowledge discovery in text application for grid computing

Author: A Hotho
AM Cohen
D Talia
EG Talbi
Gianfranco Tarricone
Giuseppe Mastronardi
H Shatkay
I Foster
IH Witten
M Castellano
M Castellano
Marcello Castellano
P Zweigenbaum
PC Carvalho
R Mooney
RC Bunescu
Roberto Bellotti
U Leser
UM Fayyad
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background A fundamental activity in biomedical research is Knowledge Discovery which has the ability to search through large amounts of biomedical information such as documents and data. High performance computational infrastructures, such as Grid technologies, are emerging as a possible infrastructure to tackle the intensive use of Information and Communication resources in life science. The goal of this work was to develop a software middleware solution in order to exploit the many knowledge discovery applications on scalable and distributed computing systems to achieve intensive use of ICT resources. Methods The development of a grid application for Knowledge Discovery in Text using a middleware solution based methodology is presented. The system must be able to: perform a user application model, process the jobs with the aim of creating many parallel jobs to distribute on the computational nodes. Finally, the system must be aware of the computational resources available, their status and must be able to monitor the execution of parallel jobs. These operative requirements lead to design a middleware to be specialized using user application modules. It included a graphical user interface in order to access to a node search system, a load balancing system and a transfer optimizer to reduce communication costs. Results A middleware solution prototype and the performance evaluation of it in terms of the speed-up factor is shown. It was written in JAVA on Globus Toolkit 4 to build the grid infrastructure based on GNU/Linux computer grid nodes. A test was carried out and the results are shown for the named entity recognition search of symptoms and pathologies. The search was applied to a collection of 5,000 scientific documents taken from PubMed. Conclusion In this paper we discuss the development of a grid application based on a middleware solution. It has been tested on a knowledge discovery in text process to extract new and useful information about symptoms and pathologies from a large collection of unstructured scientific documents. As an example a computation of Knowledge Discovery in Database was applied on the output produced by the KDT user module to extract new knowledge about symptom and pathology bio-entities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Bari

Big data: Finders keepers, losers weepers?

Author: A McAfee
B Roessler
CJ Hoofnagle
DJ Solove
H Chen
IM Kirzner
IS Rubinstein
J Turow
K Crawford
K Michael
L Floridi
Marijn Sax
NM Richards
O Tene
R Nozick
UM Fayyad
V Mayer-Schönberger
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Knowledge discOvery And daTa minINg inteGrated (KOATING) Moderators for collaborative projects

Author: A.K. Choudhary
Fayyad UM
H.K. Lin
Harding JA
Ikujiro N
J.A. Harding
M.K. Tiwari
Pechoucek M
Piatetsky-Shapiro G
R. Shankar
Shearer C
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref

Statistics and computing: the genesis of data science

Author: David J. Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
DJ Hand
HG Wells
JA Nelder
JPA Ioannidis
UM Fayyad
W Kruskal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Guest Editorial: Global modeling using local patterns

Author: A Zimmermann
A Zimmermann
AJ Knobbe
AJ Knobbe
Arno Knobbe
B Bringmann
B Goethals
DM Blei
G Forman
I Guyon
JH Friedman
Johannes Fürnkranz
P Kralj Novak
S Kramer
SM Weiss
UM Fayyad
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Partitioning clustering algorithms for protein sequence data sets

Author: A Enright
A Enright
A Herger
A Krause
DW Mount
E Bolten
E Kriventseva
F Can
G Yona
H Cathy
H Spath
J Hartigan
J Shi
KJ Anil
L Kaufman
Mohamed Limam
N Essoussi
Nadia Essoussi
O Sasson
P Cabena
P Clote
P Pipenbacher
P Sperisen
R Ng
R Tatusov
RC Dubes
S Altschul
S Henikoff
S Schneckener
S Van Dongen
SB Needleman
SE Brenner
Sondes Fayech
TF Smith
UM Fayyad
V Faber
V Guralnik
WR Pearson
Z Wu
Publication venue: BioMed Central
Publication date: 01/04/2009
Field of study

Abstract Background Genome-sequencing projects are currently producing an enormous amount of new sequences and cause the rapid increasing of protein sequence databases. The unsupervised classification of these data into functional groups or families, clustering, has become one of the principal research objectives in structural and functional genomics. Computer programs to automatically and accurately classify sequences into families become a necessity. A significant number of methods have addressed the clustering of protein sequences and most of them can be categorized in three major groups: hierarchical, graph-based and partitioning methods. Among the various sequence clustering methods in literature, hierarchical and graph-based approaches have been widely used. Although partitioning clustering techniques are extremely used in other fields, few applications have been found in the field of protein sequence clustering. It is not fully demonstrated if partitioning methods can be applied to protein sequence data and if these methods can be efficient compared to the published clustering methods. Methods We developed four partitioning clustering approaches using Smith-Waterman local-alignment algorithm to determine pair-wise similarities of sequences. Four different sets of protein sequences were used as evaluation data sets for the proposed methods. Results We show that these methods outperform several other published clustering methods in terms of correctly predicting a classifier and especially in terms of the correctness of the provided prediction. The software is available to academic users from the authors upon request.</p

Crossref

Directory of Open Access Journals

PubMed Central

Occupancy Classification of Position Weight Matrix-Inferred Transcription Factor Binding Sites

Author: A Barski
A Valouev
Aaron Cohen
B Lenhard
CC Chang
D Karolchik
DH Wolpert
DL Daniels
E Roulet
FN Jensen
G Cooper
G Pavesi
G Robertson
GC Prendergast
GD Stormo
Gregory Yochum
Hollis Wright
IH Witten
Indra Neil Sarkar
J Cohen
JE Darnell
Kemal Sönmez
KI Zeller
KJ Won
M Tompa
MA Hall
N Friedman
ND Heintzman
OJ Sansom
P Hatzis
PJ Collins
Q Sun
R Staden
S Cawley
S Sinha
Shannon McWeeney
SL Schreiber
TL Bailey
TY Roh
UM Fayyad
V Matys
VN Vapnik
Y Chen
YJ Shann
Publication venue: Public Library of Science
Publication date: 04/11/2011
Field of study

BACKGROUND: Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors. RESULTS: Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers. CONCLUSIONS: Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings

Author: A Strehl
A Topchy
ALN Fred
B Mirkin
DL Wallace
Eric Bae
G Ekman
Guozhu Dong
HW Kuhn
I Borg
J Dunn
James Bailey
JC Gower
L Hamers
L Hubert
RAM Gregson
S Theodoridis
UM Fayyad
V Estivill-Castro
W Rand
WM Rand
Y Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref